-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix LR scheduler behaviour with AMP #16229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
In the process of fixing tests I discovered and fixed a bug where the scheduler wouldn't match its optimizer when multiple optimizers are instantiated with frequencies. Now the optimizers and schedulers match and alternate as they should, resetting the cycle every epoch. |
@carmocca Ready for final review |
return | ||
active_optimizers = _get_active_optimizers( | ||
self.trainer.optimizers, self.trainer.optimizer_frequencies, self.total_batch_idx | ||
self.trainer.optimizers, self.trainer.optimizer_frequencies, self.batch_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test to verify this works properly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the third case of test_step_scheduling_for_multiple_optimizers_with_frequency
so that it tests that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the failing tests?
setup.cfg
Outdated
cloud:Run the cloud tests for example | ||
filterwarnings = | ||
error::FutureWarning | ||
error:Detected call of `lr_scheduler.step\(\)` before `optimizer.step\(\)`:UserWarning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this line so that our CI fails if this warning appears. This way it tests that your patch works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but this also makes IPU tests fail, this PR is focused on GPU. Not sure where to fix the issue on IPUs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
The way I fixed the |
I also modified |
Let me check what is missing here... |
Is this PR merged already? I'm still having this issue. |
there were some failing tests, @milesial mind have a look? |
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (29%) is below the target coverage (50%). You can increase the patch coverage or adjust the target coverage.
Additional details and impacted files@@ Coverage Diff @@
## master #16229 +/- ##
=========================================
- Coverage 87% 48% -39%
=========================================
Files 269 266 -3
Lines 23656 23606 -50
=========================================
- Hits 20633 11419 -9214
- Misses 3023 12187 +9164 |
for more information, see https://pre-commit.ci
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions. |
What does this PR do?
When training when native AMP and a LR scheduler, we get this warning that indicates that a LR step has been taken when an optimizer step was skipped (expected at the beginning of the training with native AMP):
Fixes #16228 #5558
Does your PR introduce any breaking changes? If yes, please list them.
No
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list: